HaiBiostat: Series 4 of 10 -- Fitting Linear Models - Multiple Regression

Hai Nguyen

Review a simple linear regression in Bayes

We can see a post how to fit a simple linear regression in both Frequentist approach and Bayesian methods. Now we move on to the linear multiple regression.

Multiple Regression

Example 1: Two significant predictors

Generate data for multiple linear regression with 2 independent significant predictors.

# generate data
set.seed(03182021)
Ntotal <- 500
x <- cbind(rnorm(Ntotal, mean = 20, sd = 4), 
           rnorm(Ntotal, mean=10, sd = 6))
Nx <- ncol(x)
y <- 4 + 1.1*x[,1] + 3*x[,2] + rnorm(Ntotal, mean = 0, sd = 1)

Create a data list.

dataListRegression <- list(Ntotal = Ntotal, y = y, x = as.matrix(x), Nx = Nx)

Here’s a model in the Frequentist method:

summary(lm(y~x))


Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.83480 -0.63696  0.00131  0.69896  2.48803 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.565419   0.230456   19.81   <2e-16 ***
x1          1.074206   0.010945   98.14   <2e-16 ***
x2          2.995385   0.007954  376.57   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9759 on 497 degrees of freedom
Multiple R-squared:  0.9969,    Adjusted R-squared:  0.9968 
F-statistic: 7.895e+04 on 2 and 497 DF,  p-value: < 2.2e-16

The diagram of the model shows hierarchical structure with normal priors for intercept \(\beta_0\) and slopes \(\beta_i\), \(i=1,2\).

Description of the model:

\[y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \epsilon_i \\ y_i \sim t(\mu_i, \sigma, \nu) \\ \\ \text{where mean, scale and degree of freedom, respectively, would be} \\ \mu_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} \\ \beta_0 \sim N(M_0, S_0) \ ; \ \beta_j \sim N(M_1, S_1) \\ \sigma \sim Unif(L, H) \\ \nu = \gamma^{\prime} + 1, \ \text{where} \ \gamma^{\prime} \sim exp(\lambda)\]

Now, we come back to the problem. Here’s we write a model string:

modelString<-"
data {
    int<lower=1> Ntotal;
    int<lower=1> Nx;
    vector[Ntotal] y;
    matrix[Ntotal, Nx] x;
}
transformed data {
    real meanY;
    real sdY;
    vector[Ntotal] zy; // normalized
    vector[Nx] meanX;
    vector[Nx] sdX;
    matrix[Ntotal, Nx] zx; // normalized
    
    meanY = mean(y);
    sdY = sd(y);
    zy = (y - meanY) / sdY;
    for (j in 1:Nx) {
        meanX[j] = mean(x[,j]);
        sdX[j] = sd(x[,j]);
        for ( i in 1:Ntotal ) {
            zx[i,j] = (x[i,j] - meanX[j]) / sdX[j];
        }
    }
}
parameters {
    real zbeta0;
    vector[Nx] zbeta;
    real<lower=0> nu;
    real<lower=0> zsigma;
}
transformed parameters{
    vector[Ntotal] zy_hat;
    zy_hat = zbeta0 + zx * zbeta;
}
model {
    zbeta0 ~ normal(0, 2);
    zbeta  ~ normal(0, 2);
    nu ~ exponential(1/30.0);
    zsigma ~ uniform(1.0E-5 , 1.0E+1);
    zy ~ student_t(1+nu, zy_hat, zsigma);
}
generated quantities { 
    // Transform to original scale:
    real beta0; 
    vector[Nx] beta;
    real sigma;
    // .* and ./ are element-wise product and divide
    beta0 = zbeta0*sdY  + meanY - sdY * sum( zbeta .* meanX ./ sdX );
    beta = sdY * ( zbeta ./ sdX );
    sigma = zsigma * sdY;
} "

RobustMultipleRegressionDso <- stan_model(model_code=modelString)

If saved DSO is used load it, then run the chains.

# saveRDS(RobustMultipleRegressionDso, file="data/DSORobustMultRegr.Rds")
RobustMultipleRegressionDso <- readRDS("data/DSORobustMultRegr.Rds")

Fit the model.

fit1 <- sampling(RobustMultipleRegressionDso,
                 data=dataListRegression,
                 pars=c('beta0', 'beta', 'nu', 'sigma'),
                 iter = 5000, chains = 2, cores = 2)
stan_ac(fit1)

stan_trace(fit1)

Look at the results.

summary(fit1)$summary[,c(1,3,4,5,8,9)] #mean, sd, 2.5%, 97.5%, n_eff

                mean           sd         2.5%         25%
beta0      4.5780310  0.231357219    4.1304172    4.422841
beta[1]    1.0738282  0.011006048    1.0521979    1.066428
beta[2]    2.9951527  0.008047017    2.9797451    2.989759
nu        61.7987150 36.009712586   17.0095629   36.082778
sigma      0.9611543  0.032896641    0.8978996    0.938909
lp__    1014.2033841  1.585209650 1010.1168812 1013.426674
              97.5%    n_eff
beta0      5.025614 7033.830
beta[1]    1.095054 6561.254
beta[2]    3.011046 6871.816
nu       151.390448 4557.868
sigma      1.027262 4328.823
lp__    1016.293483 2600.401

pairs(fit1,pars=c("beta0","beta[1]","beta[2]"))

plot(fit1,pars="nu")

plot(fit1,pars="sigma")

plot(fit1,pars="beta0")

plot(fit1,pars="beta[1]")

plot(fit1,pars="beta[2]")

Analyze fitted model using shinystan

launch_shinystan(fit1)

Conclusions:

\(\nu\) (degree of freedom in t-distribution) is large enough to consider normal distribution: 2.5% HDI level is 12.6257143, mean value is 49.9778865. Not surprising: we simulated normal model \(\Rightarrow\) normality parameter \(\nu\)
Parameters \(\beta_0\) and \(\beta_1\) are significantly negatively correlated, as expected.
Parameters \(\beta_0\) and \(\beta_2\) are also negatively correlated, but correlation is not so strong.
All parameter estimates are close to what we simulated.

Reminder of Stan

A Stan program has three required “blocks”:

data block: where you declare the data types, their dimensions, any restrictions (i.e. upper = or lower = , which act as checks for Stan), and their names. Any names you give to your Stan program will also be the names used in other blocks.
parameters block: This is where you indicate the parameters you want to model, their dimensions, restrictions, and name. For a linear regression, we will want to model the intercept, any slopes, and the standard deviation of the errors around the regression line.
model block: This is where you include any sampling statements, including the “likelihood” (model) you are using. The model block is where you indicate any prior distributions you want to include for your parameters. If no prior is defined, Stan uses default priors with the specifications uniform(-infinity, +infinity). You can restrict priors using upper or lower when declaring the parameters (i.e. <lower = 0> to make sure a parameter is positive). You can find more information about prior specification here.

Sampling is indicated by the ~ symbol, and Stan already includes many common distributions as vectorized functions. You can check out the manual for a comprehensive list and more information on the optional blocks you could include in your Stan model.

There are also four optional blocks:

functions
transformed data: allows for preprocessing of the data
transformed parameters: allows for parameter processing before the posterior is computed

Objects declared in the “transformed parameters” block of a Stan program are:

Unknown but are known given the values of the objects in the parameters block
Saved in the output and hence should be of interest to the researcher
Are usually the arguments to the log-likelihood function that is evaluated in the model block, although in hierarchical models the line between the prior and the likelihood can be drawn in multiple ways

(if the third point is not the case, the object should usually be declared in the generated quantities block of a Stan program)

The purpose of declaring such things in the transformed parameters block rather than the parameters block is often to obtain more efficient sampling from the posterior distribution. If there is a posterior PDF \(f(\theta \mid \text{data})\), then for any objective transformation from \(\alpha\) to \(\theta\), the posterior PDF of \(\alpha\) is simply \(f(\)\((\)\() \mid data)\text{abs}|J|\), where \(|J|\) is the determinant of the Jacobian matrix of the transformation from \(\alpha\) to \(\theta\). Thus, you can make the same inferences about (functions of) \(\theta\) either by drawing from the posterior whose PDF is \(f(\theta \mid \text{data})\) where \(\theta\) are the parameters or the posterior whose PDF is f(θ(α)|data)abs|J| where α are parameters and θ are transformed parameters. Since the posterior inferences about (functions of) θ are the same, you are free to choose a transformation that enhances the efficiency of the sampling by making α less correlated, unit scaled, more Gaussian, etc. than is θ.

“generated quantities”

\(\Rightarrow\) Transformation to improve fit Comments are indicated by // in Stan. The write("model code", "file_name") bit allows us to write the Stan model in our R script and output the file to the working directory (or you can set a different file path)

Example 2: Insignificant predictor

Regression.Data <- as.matrix(read.csv("data/DtSim4RegANOVA.csv", header=TRUE, sep=","))
tail(Regression.Data)

          Output     Input1      Input2
[495,] 2.4054442  0.9276934  0.07278244
[496,] 1.8663026 -0.3678520  1.51715986
[497,] 1.3590146  0.5369795  0.96209003
[498,] 3.1836007  1.0171332 -0.56660564
[499,] 2.3615061  1.1637966  0.07815352
[500,] 0.8483407  1.1775607  1.59720356

Prepare the data for Stan.

Ntotal <- nrow(Regression.Data)
x <- Regression.Data[ ,2:3]
tail(x)

           Input1      Input2
[495,]  0.9276934  0.07278244
[496,] -0.3678520  1.51715986
[497,]  0.5369795  0.96209003
[498,]  1.0171332 -0.56660564
[499,]  1.1637966  0.07815352
[500,]  1.1775607  1.59720356

Nx <- ncol(x)
y <- Regression.Data[ ,1]
dataListInsig <- list(Ntotal=Ntotal, 
                      y=y, 
                      x=as.matrix(x), 
                      Nx=Nx)

Run MCMC using the same DSO.

fit2 <- sampling(RobustMultipleRegressionDso, data=dataListInsig,
                 pars=c('beta0', 'beta', 'nu', 'sigma'),
                 iter=5000, chains = 2, cores = 2)

launch_shinystan(fit2)

Analyze the results.

summary(fit2)$summary[,c(1,3,4,8,9)] #mean, sd, 2.5%, 97.5%, n_eff

                 mean          sd          2.5%         97.5%
beta0    1.217597e+00  0.05182680    1.11612951    1.31636194
beta[1]  7.999726e-01  0.02812159    0.74453626    0.85437808
beta[2]  9.416795e-03  0.02736786   -0.04397004    0.06323134
nu       5.276307e+01 34.06776500   13.74259262  142.96095320
sigma    5.979875e-01  0.02144590    0.55558445    0.64151495
lp__    -1.817471e+02  1.53474481 -185.40442023 -179.67956203
           n_eff
beta0   5838.281
beta[1] 6158.199
beta[2] 5787.610
nu      4567.018
sigma   4814.577
lp__    2650.410

pairs(fit2,pars=c("beta0","beta[1]","beta[2]"))

plot(fit2,pars="nu")

plot(fit2,pars="sigma")

plot(fit2,pars="beta0")

plot(fit2,pars="beta[1]")

plot(fit2,pars="beta[2]")

We see that parameter \(\beta_2\) is not significant.
However, there is no strong correlation or redundancy between the predictors.

Compare with the output of linear model

pairs(Regression.Data)

summary(lm(Output~., data=as.data.frame(Regression.Data)))


Call:
lm(formula = Output ~ ., data = as.data.frame(Regression.Data))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76631 -0.39358 -0.01411  0.40432  1.91861 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.21480    0.05179  23.458   <2e-16 ***
Input1       0.80116    0.02819  28.423   <2e-16 ***
Input2       0.00970    0.02787   0.348    0.728    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6107 on 497 degrees of freedom
Multiple R-squared:  0.6204,    Adjusted R-squared:  0.6188 
F-statistic: 406.1 on 2 and 497 DF,  p-value: < 2.2e-16

Example 3: Correlated predictors

Strong correlation

set.seed(03192021)
Ntotal <- 500
x1 <- rnorm(Ntotal, mean = 20, sd = 4)
x2 <- 1 - 1.5*x1 + rnorm(Ntotal, mean=0, sd = .1)
x <- cbind(x1,x2)      

plot(x)

Nx <- ncol(x)
y <- 4 + .2*x[,1] + 3*x[,2]+rnorm(Ntotal, mean = 0, sd = 1)
plot(x[,1],y)

plot(x[,2],y)

fitlm<-lm(y~x[,1]+x[,2])
summary(fitlm)


Call:
lm(formula = y ~ x[, 1] + x[, 2])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7765 -0.7585  0.0172  0.6577  3.1721 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.5619     0.4878   9.353  < 2e-16 ***
x[, 1]       -0.5468     0.6707  -0.815    0.415    
x[, 2]        2.5034     0.4479   5.589 3.76e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9988 on 497 degrees of freedom
Multiple R-squared:  0.9962,    Adjusted R-squared:  0.9962 
F-statistic: 6.53e+04 on 2 and 497 DF,  p-value: < 2.2e-16

drop1(fitlm)

Single term deletions

Model:
y ~ x[, 1] + x[, 2]
       Df Sum of Sq    RSS     AIC
<none>              495.85  1.8338
x[, 1]  1    0.6632 496.51  0.5021
x[, 2]  1   31.1702 527.02 30.3165

dataListShrink2 <- list(Ntotal=Ntotal, y=y, x=as.matrix(x), Nx=Nx)

Note that actual coefficient for x[,1] is \(+0.2\), but slope on the plot plot(x[,1],y) is negative.
Also note that estimated model coefficients are different from actual because of correlation:

cbind(actual=c(4,.2,3),estimated=fitlm$coefficients)

            actual  estimated
(Intercept)    4.0  4.5619222
x[, 1]         0.2 -0.5468278
x[, 2]         3.0  2.5034438

Run the chains and analyze the results.

tStart<-proc.time()
fit3<-sampling(RobustMultipleRegressionDso,
               data = dataListShrink2,
               pars = c('beta0', 'beta', 'nu', 'sigma'),
               iter = 5000, chains = 2, cores = 2)
tEnd<-proc.time()
tEnd-tStart

   user  system elapsed 
   0.51    0.39   92.09

How long did it take to run this MCMC? Why so long?

Monte Carlo methods assume that the samples are independent. This is usually not the case for sequential draws in Markov Chain Monte Carlo (MCMC) sampling, motivating the use of thinning. In thinning, we keep every T sample (\(P(x \le t)\)) and throw away the other samples. For some types of MCMC (notably, Gibbs sampling), highly correlated variables result in highly correlated samples (i.e., definitely not independent), requiring a large T to compensate. Large values of T mean a large computational cost for each effective sample (i.e., for each sample that is kept).

Check convergence in shiny.

launch_shinystan(fit3)

stan_dens(fit3)

stan_ac(fit3, separate_chains = T)

summary(fit3)$summary[,c(1,3,4,8,9)]

               mean          sd        2.5%       97.5%    n_eff
beta0     4.5857595  0.48813790   3.6314232   5.5544965 2114.939
beta[1]  -0.5967437  0.67710628  -1.9314954   0.7388229 1713.891
beta[2]   2.4699613  0.45229741   1.5765319   3.3619226 1712.967
nu       56.0426640 33.61895666  15.8382163 143.5659182 2718.137
sigma     0.9816801  0.03355264   0.9166907   1.0483195 2731.866
lp__    967.3463697  1.56455325 963.5132940 969.3727437 1911.020

pairs(fit3,pars=c("beta0","beta[1]","beta[2]"))

plot(fit3,pars="nu")

plot(fit3,pars="sigma")

plot(fit3,pars="beta0")

plot(fit3,pars="beta[1]")

plot(fit3,pars="beta[2]")

General signs of collinear predictors:
- High correlation between slopes (compensating sign)
- Wide posterior distributions for slopes
- Increased autocorrelation for slopes

pairs(cbind(y,x1,x2))

cbind(actual=c(4,.2,3),estimatedLm=fitlm$coefficients,estimatedBayes=summary(fit3)$summary[1:3,1])

            actual estimatedLm estimatedBayes
(Intercept)    4.0   4.5619222      4.5857595
x[, 1]         0.2  -0.5468278     -0.5967437
x[, 2]         3.0   2.5034438      2.4699613

Linear model shows the same information as Bayesian.

Collinearity

In case when predictors have strong collinearity, linear model may stop working.
Simulate the same model as in the previous section, but make predictors collinear.

set.seed(03192021)
Ntotal <- 500
x1 <- rnorm(Ntotal, mean = 20, sd = 4)
x2<-1-1.5*x1+rnorm(Ntotal, mean=0, sd = .000001) # sd closes to 0
x<-cbind(x1,x2)           
plot(x)

Nx <- ncol(x)
y <- 4 + .2*x[,1] + 3*x[,2]+rnorm(Ntotal, mean = 0, sd = 1)
plot(x[,1],y)

plot(x[,2],y)

dataListShrink2c <- list(Ntotal=Ntotal, y=y, x=as.matrix(x), Nx=Nx)
(lmFit <- lm(y~x1+x2))


Call:
lm(formula = y ~ x1 + x2)

Coefficients:
(Intercept)           x1           x2  
      7.094       -4.303           NA

summary(lmFit)


Call:
lm(formula = y ~ x1 + x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7626 -0.7636  0.0244  0.6410  3.1747 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.09394    0.24447   29.02   <2e-16 ***
x1          -4.30334    0.01189 -361.96   <2e-16 ***
x2                NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9991 on 498 degrees of freedom
Multiple R-squared:  0.9962,    Adjusted R-squared:  0.9962 
F-statistic: 1.31e+05 on 1 and 498 DF,  p-value: < 2.2e-16

drop1(lmFit)

Single term deletions

Model:
y ~ x1 + x2
       Df Sum of Sq    RSS    AIC
<none>              497.08 1.0687
x1      0 0.0001417 497.08 1.0688
x2      0 0.0000000 497.08 1.0687

Linear model stops working.

Simulate Markov chains.

cbind(actual=c(4,.2,3),estimated=fitlm$coefficients)

            actual  estimated
(Intercept)    4.0  4.5619222
x[, 1]         0.2 -0.5468278
x[, 2]         3.0  2.5034438

Run the chains and analyze the results.

tStart <- proc.time()
fit3c <- sampling(RobustMultipleRegressionDso,
                data=dataListShrink2c,
                pars=c('beta0', 'beta', 'nu', 'sigma'),
                iter=5000, chains = 1, cores = 2)


SAMPLING FOR MODEL '8e2e8be6f09a2541ae94da0604a417ec' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 5000 [  0%]  (Warmup)
Chain 1: Iteration:  500 / 5000 [ 10%]  (Warmup)
Chain 1: Iteration: 1000 / 5000 [ 20%]  (Warmup)
Chain 1: Iteration: 1500 / 5000 [ 30%]  (Warmup)
Chain 1: Iteration: 2000 / 5000 [ 40%]  (Warmup)
Chain 1: Iteration: 2500 / 5000 [ 50%]  (Warmup)
Chain 1: Iteration: 2501 / 5000 [ 50%]  (Sampling)
Chain 1: Iteration: 3000 / 5000 [ 60%]  (Sampling)
Chain 1: Iteration: 3500 / 5000 [ 70%]  (Sampling)
Chain 1: Iteration: 4000 / 5000 [ 80%]  (Sampling)
Chain 1: Iteration: 4500 / 5000 [ 90%]  (Sampling)
Chain 1: Iteration: 5000 / 5000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 155.086 seconds (Warm-up)
Chain 1:                114.928 seconds (Sampling)
Chain 1:                270.014 seconds (Total)
Chain 1:

tEnd <- proc.time()
tEnd-tStart

   user  system elapsed 
 261.57    0.17  270.09

With collinear predictors model definitely takes much longer time to simulate.

stan_dens(fit3c)

stan_ac(fit3c, separate_chains = T)

summary(fit3c)$summary[,c(1,3,4,8,9)]

               mean          sd        2.5%      97.5%    n_eff
beta0     5.2869784  4.12513863  -2.5410297  13.834655 258.6426
beta[1]  -1.5985501  6.17402618 -14.3329219  10.098421 255.8424
beta[2]   1.8031011  4.11591117  -6.6807200   9.601809 255.7672
nu       57.5235574 34.84527395  15.7981673 148.866690 605.1602
sigma     0.9834476  0.03440985   0.9179121   1.052713 331.5993
lp__    967.5976423  1.48319659 964.0943396 969.579397 845.7838

pairs(fit3c,pars=c("beta0","beta[1]","beta[2]"))

plot(fit3c,pars="nu")

plot(fit3c,pars="sigma")

plot(fit3c,pars="beta0")

plot(fit3c,pars="beta[1]")

plot(fit3c,pars="beta[2]")

Markov chains may go over limit on tree depths (yellow dots on pairs graph).
But Bayesian method still works. It shows that one or both of the slopes are not significantly different from zero.

Shrinkage of regression coefficients

When there are many candidate predictors in the model it may be useful to “motivate” them to become closer to zero if they are not very strong.
One way to do it is to:

Set a prior distribution for slopes as t-Student instead of normal;
Make mean of that distribution equal to zero;
Make normality parameter \(\nu\) small and dispersion parameter \(\sigma\) also small

Small \(\sigma\) forces slopes to shrink towards zero mean. At the same time, small \(\nu\) makes the tails fat enough to allow some strong slopes to be outliers.

Parameter \(\sigma\) of the prior for regression coefficients \(\beta_j\) can be either fixed, or given its own prior and estimated.

In the former case, all coefficients will be forced to have the same regularizator, if it is random and estimated from the same data then there is mutual influence between \(\sigma\) and regression coefficients: if many of them are close to zero then \(\sigma\) is going to be smaller, which in turn pushes coefficients even closer to zero.

These approaches reminds us the Ridge/LASSO/Elastic net regression in Machine Learning models.

Two significant predictors

Use the same data dataListRegression as in the above section.
Describe the model.

\[y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \epsilon_i \\ y_i \sim t(\mu_i, \sigma, \nu) \\ \\ \text{where mean, scale and degree of freedom, respectively, would be} \\ \mu_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} \\ \beta_0 \sim N(M_0, S_0) \ ; \ \beta_j \sim t(\mu_j, \nu_j, \sigma_{\beta}) \\ \sigma_{\beta} \sim gamma(shape,rate) \\ \sigma \sim Unif(L, H) \\ \nu = \gamma^{\prime} + 1, \ \text{where} \ \gamma^{\prime} \sim exp(\lambda)\]

modelString<-"
data {
    int<lower=1> Ntotal;
    int<lower=1> Nx;
    vector[Ntotal] y;
    matrix[Ntotal, Nx] x;
}
transformed data {
    real meanY;
    real sdY;
    vector[Ntotal] zy; // normalized
    vector[Nx] meanX;
    vector[Nx] sdX;
    matrix[Ntotal, Nx] zx; // normalized
    
    meanY = mean(y);
    sdY = sd(y);
    zy = (y - meanY) / sdY;
    for (j in 1:Nx) {
        meanX[j] = mean(x[,j]);
        sdX[j] = sd(x[,j]);
        for (i in 1:Ntotal) {
            zx[i,j] = (x[i,j] - meanX[j]) / sdX[j];
        }
    }
}
parameters {
    real zbeta0;
    real<lower=0> sigmaBeta;
    vector[Nx] zbeta;
    real<lower=0> nu;
    real<lower=0> zsigma;
}
transformed parameters{
    vector[Ntotal] zy_hat;
    zy_hat = zbeta0 + zx * zbeta;
}
model {
    zbeta0 ~ normal(0, 2);
    sigmaBeta ~ gamma(2.3,1.3); // mode=(alpha-1)/beta, var=alpha/beta^2
    zbeta  ~ student_t(1.0/30.0, 0, sigmaBeta);
    nu ~ exponential(1/30.0);
    zsigma ~ uniform(1.0E-5 , 1.0E+1);
    zy ~ student_t(1+nu, zy_hat, zsigma);
}
generated quantities { 
    // Transform to original scale:
    real beta0; 
    vector[Nx] beta;
    real sigma;
    // .* and ./ are element-wise product and divide
    beta0 = zbeta0*sdY  + meanY - sdY * sum( zbeta .* meanX ./ sdX );
    beta = sdY * ( zbeta ./ sdX );
    sigma = zsigma * sdY;
} "

Gamma distribution prior for sigmaBeta is selected to have relatively low mode 1.

xGamma <- seq(from = .00001, to = 10, by = .001)
plot(xGamma,dgamma(xGamma,shape=2.3,rate=1.3),type="l")

xGamma[which.max(dgamma(xGamma,shape=2.3,rate=1.3))]

[1] 1.00001

Create DSO.

RegressionShrinkDso <- stan_model(model_code = modelString)

If saved DSO is used load it, then run the chains.

# save(RegressionShrinkDso, file = "data/DSOShrunkMultRegr.Rds")
load("data/DSOShrunkMultRegr.Rds")

Generate Markov chains in case of 2 significant predictors.

tStart<-proc.time()
# fit model
fit4 <- sampling(RegressionShrinkDso, 
             data=dataListRegression, 
             pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
             iter=5000, chains = 2, cores = 2
)
tEnd<-proc.time()
tEnd-tStart

   user  system elapsed 
   0.40    0.25   18.60

Analyze fitted model using shinystan

launch_shinystan(fit4)

stan_dens(fit4)

stan_ac(fit4, separate_chains = T)

summary(fit4)$summary[,c(1,3,4,8,9)]

                  mean           sd         2.5%       97.5%    n_eff
beta0        4.5843612  0.228604011    4.1393299    5.027146 8748.136
beta[1]      1.0735281  0.010973133    1.0524275    1.095341 8200.715
beta[2]      2.9950789  0.007839406    2.9796239    3.009790 7482.511
nu          61.3810270 35.734724611   17.0630928  149.281989 5570.822
sigma        0.9613439  0.033176999    0.8953832    1.028195 5375.540
sigmaBeta    1.3993861  0.899552223    0.2216032    3.632021 6406.796
lp__      1010.3909262  1.749705508 1006.2179766 1012.779996 2263.114

pairs(fit4,pars=c("beta0","beta[1]","beta[2]"))

plot(fit4,pars="nu")

plot(fit4,pars="sigma")

plot(fit4,pars="beta0")

plot(fit4,pars="beta[1]")

plot(fit4,pars="beta[2]")

Analysis and comparison

Compare posterior mean values and 95% HDI with fit1 (same model, but with no shrinkage).

cbind(summary(fit1)$summary[1:3,c(1,4,8)],
      summary(fit4)$summary[1:3,c(1,4,8)])

            mean     2.5%    97.5%     mean     2.5%    97.5%
beta0   4.578031 4.130417 5.025614 4.584361 4.139330 5.027146
beta[1] 1.073828 1.052198 1.095054 1.073528 1.052427 1.095341
beta[2] 2.995153 2.979745 3.011046 2.995079 2.979624 3.009790

Mean values of both fits seem very similar.
Check widths of the HDI for coefficients.

cbind(summary(fit1)$summary[1:3,c(8)]-summary(fit1)$summary[1:3,c(4)],
      summary(fit4)$summary[1:3,c(8)]-summary(fit4)$summary[1:3,c(4)])

              [,1]       [,2]
beta0   0.89519719 0.88781592
beta[1] 0.04285560 0.04291323
beta[2] 0.03130079 0.03016606

Shrinkage can be noticed after third digit of all coefficients.
In this example both slopes are significant and they practically did not shrink.

For comparison fit linear model, ridge and lasso regressions to the same data.

1- Linear model.

lmFit<-lm(dataListRegression$y~dataListRegression$x[,1]+dataListRegression$x[,2])

2- Ridge.

library(glmnet)
set.seed(15)
cv.outRidge <- cv.glmnet(x = dataListRegression$x, 
                         y = dataListRegression$y, 
                         alpha=0)
plot(cv.outRidge)

(bestlam <-cv.outRidge$lambda.min)

[1] 1.680471

ridgeFit <- glmnet(x = dataListRegression$x, y=dataListRegression$y,
                   alpha = 0, lambda = bestlam, standardize = F)
(ridge.coef <- predict(ridgeFit,type="coefficients", s = bestlam))

3 x 1 sparse Matrix of class "dgCMatrix"
                  s1
(Intercept) 4.768742
V1          1.068697
V2          2.986144

3- Lasso.

set.seed(15)
cv.outLasso <- cv.glmnet(x = dataListRegression$x, 
                         y = dataListRegression$y, 
                         alpha=1)
plot(cv.outLasso)

(bestlam <-cv.outLasso$lambda.min)

[1] 0.08363741

lassoFit<-glmnet(x=dataListRegression$x,y=dataListRegression$y,
                 alpha=1,lambda=bestlam,standardize = F)
(lasso.coef<-predict(lassoFit,type="coefficients",s=bestlam))

3 x 1 sparse Matrix of class "dgCMatrix"
                  s1
(Intercept) 4.689546
V1          1.069233
V2          2.992895

Compare coefficients from all 5 models

comparison<-cbind(summary(fit1)$summary[1:3,c(1,4,8)],
      summary(fit4)$summary[1:3,c(1,4,8)],
      Ridge=ridge.coef,
      Lasso=lasso.coef,
      Linear=lmFit$coefficients)
colnames(comparison)<-c(paste("NoShrinkage",c("mean","2.5%","97.5%"),sep="_"),
                        paste("Shrinkage",c("mean","2.5%","97.5%"),sep="_"),
                        "Ridge","Lasso","Linear")
t(comparison)

9 x 3 sparse Matrix of class "dgCMatrix"
                     beta0  beta[1]  beta[2]
NoShrinkage_mean  4.578031 1.073828 2.995153
NoShrinkage_2.5%  4.130417 1.052198 2.979745
NoShrinkage_97.5% 5.025614 1.095054 3.011046
Shrinkage_mean    4.584361 1.073528 2.995079
Shrinkage_2.5%    4.139330 1.052427 2.979624
Shrinkage_97.5%   5.027146 1.095341 3.009790
Ridge             4.768742 1.068697 2.986144
Lasso             4.689546 1.069233 2.992895
Linear            4.565419 1.074206 2.995385

All models show practically no shrinkage relative to linear model.
Both Ridge and Lasso regression have too high estimates of intercept.

Insignificant predictor

Shrink estimates from data dataListInsig.

tStart<-proc.time()
# fit model
fit5 <- sampling (RegressionShrinkDso, 
             data=dataListInsig, 
             pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
             iter=5000, chains = 2, cores = 2
)
tEnd<-proc.time()
tEnd-tStart

   user  system elapsed 
   0.44    0.17   18.02

We can analyze fitted model with shinystan but

stan_dens(fit5)

stan_ac(fit5, separate_chains = T)

summary(fit5)$summary[,c(1,3,4,8,9)]

                   mean          sd          2.5%         97.5%
beta0      1.221117e+00  0.05300762    1.11769338    1.32774329
beta[1]    7.988002e-01  0.02885518    0.74227587    0.85594136
beta[2]    7.983679e-03  0.02713646   -0.04570233    0.06256174
nu         5.257628e+01 33.19871103   13.71139258  137.97027648
sigma      5.977651e-01  0.02104643    0.55653680    0.63914284
sigmaBeta  1.043781e+00  0.83037712    0.10282820    3.17391417
lp__      -1.850178e+02  1.72628557 -189.19759519 -182.70019632
             n_eff
beta0     6549.983
beta[1]   6984.701
beta[2]   6777.038
nu        4536.272
sigma     5500.091
sigmaBeta 5994.464
lp__      2157.592

pairs(fit5,pars=c("beta0","beta[1]","beta[2]"))

plot(fit5,pars="nu")

plot(fit5,pars="sigma")

plot(fit5,pars="beta0")

plot(fit5,pars="beta[1]")

plot(fit5,pars="beta[2]")

This time posterior density of \(\beta_2\) (beta[2]) is concentrated at zero.

Analysis and comparison

Compare mean levels and HDI widths for fits with and without shrinkage.

cbind(summary(fit2)$summary[1:3,c(1,4,8)],
      summary(fit5)$summary[1:3,c(1,4,8)])

               mean        2.5%      97.5%        mean        2.5%
beta0   1.217597421  1.11612951 1.31636194 1.221117106  1.11769338
beta[1] 0.799972585  0.74453626 0.85437808 0.798800184  0.74227587
beta[2] 0.009416795 -0.04397004 0.06323134 0.007983679 -0.04570233
             97.5%
beta0   1.32774329
beta[1] 0.85594136
beta[2] 0.06256174

cbind(summary(fit2)$summary[1:3,c(8)]-summary(fit2)$summary[1:3,c(4)],
      summary(fit5)$summary[1:3,c(8)]-summary(fit5)$summary[1:3,c(4)])

             [,1]      [,2]
beta0   0.2002324 0.2100499
beta[1] 0.1098418 0.1136655
beta[2] 0.1072014 0.1082641

Parameters shrunk a little more this time, second coefficient shrunk to zero.

Again, fit linear model, ridge and lasso regressions to the same data.

1- Linear model.

lmFit<-lm(dataListInsig$y~dataListInsig$x[,1]+dataListInsig$x[,2])

2- Ridge.

set.seed(15)
cv.outRidge=cv.glmnet(x=dataListInsig$x,y=dataListInsig$y,alpha=0)
plot(cv.outRidge)

(bestlam <-cv.outRidge$lambda.min)

[1] 0.07783229

ridgeFit<-glmnet(x=dataListInsig$x,y=dataListInsig$y,
                 alpha=0,lambda=bestlam,standardize = F)
(ridge.coef<-predict(ridgeFit,type="coefficients",s=bestlam))

3 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) 1.287098608
Input1      0.739131424
Input2      0.004155875

3- Lasso.

set.seed(15)
cv.outLasso=cv.glmnet(x=dataListInsig$x,y=dataListInsig$y,alpha=1)
plot(cv.outLasso)

(bestlam <-cv.outLasso$lambda.min)

[1] 0.02067294

lassoFit<-glmnet(x=dataListInsig$x,y=dataListInsig$y,
                 alpha=1,lambda=bestlam,standardize = F)
(lasso.coef<-predict(lassoFit,type="coefficients",s=bestlam))

3 x 1 sparse Matrix of class "dgCMatrix"
                  s1
(Intercept) 1.249387
Input1      0.778466
Input2      .

Compare coefficients from all 3 models.

comparison<-cbind(summary(fit2)$summary[1:3,c(1)],
      summary(fit5)$summary[1:3,c(1)],
      Ridge=ridge.coef,
      Lasso=lasso.coef,
      Linear=lmFit$coefficients)
colnames(comparison)<-c("NoShrinkage","Shrinkage","Ridge","Lasso","Linear")
t(comparison)

5 x 3 sparse Matrix of class "dgCMatrix"
            (Intercept)    Input1      Input2
NoShrinkage    1.217597 0.7999726 0.009416795
Shrinkage      1.221117 0.7988002 0.007983679
Ridge          1.287099 0.7391314 0.004155875
Lasso          1.249387 0.7784660 .          
Linear         1.214802 0.8011573 0.009700042

All models correctly exclude second coefficient.
Ridge shrunk both slopes more than other models.
There is again tendency for Ridge and Lasso to overestimate intercept.

Correlated predictors

Shrink coefficients estimated from dataListShrink2.

tStart<-proc.time()
# fit model
fit6 <- sampling (RegressionShrinkDso, 
             data=dataListShrink2, 
             pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
             iter=5000, chains = 2, cores = 2
)
tEnd<-proc.time()
tEnd-tStart

   user  system elapsed 
   0.54    0.10   91.97

We could analyze model with shinystan but let’s check densities, pairs and individual plots of parameters.

stan_dens(fit6)

stan_ac(fit6, separate_chains = T)

summary(fit6)$summary[,c(1,3,4,8,9)]

                 mean          sd        2.5%       97.5%    n_eff
beta0       4.5437360  0.47198344   3.6607788   5.5095532 1845.144
beta[1]    -0.5261239  0.63908279  -1.8992740   0.6344545 1575.212
beta[2]     2.5172117  0.42658592   1.6035586   3.2916493 1575.367
nu         56.5744375 34.07819768  14.7590894 143.3833364 3428.014
sigma       0.9804926  0.03427887   0.9135965   1.0470397 3347.567
sigmaBeta   1.2388283  0.85960126   0.1607706   3.3968145 3997.294
lp__      963.8218087  1.75910037 959.4784596 966.1987987 2037.995

pairs(fit6,pars=c("beta0","beta[1]","beta[2]"))

plot(fit6,pars="nu")

plot(fit6,pars="sigma")

plot(fit6,pars="beta0")

plot(fit6,pars="beta[1]")

plot(fit6,pars="beta[2]")

Analysis and comparison

Show mean values and HDI.

cbind(summary(fit3)$summary[1:3,c(1,4,8)],
      summary(fit6)$summary[1:3,c(1,4,8)])

              mean      2.5%     97.5%       mean      2.5%     97.5%
beta0    4.5857595  3.631423 5.5544965  4.5437360  3.660779 5.5095532
beta[1] -0.5967437 -1.931495 0.7388229 -0.5261239 -1.899274 0.6344545
beta[2]  2.4699613  1.576532 3.3619226  2.5172117  1.603559 3.2916493

cbind(summary(fit3)$summary[1:3,c(8)]-summary(fit3)$summary[1:3,c(4)],
      summary(fit6)$summary[1:3,c(8)]-summary(fit6)$summary[1:3,c(4)])

            [,1]     [,2]
beta0   1.923073 1.848774
beta[1] 2.670318 2.533728
beta[2] 1.785391 1.688091

In this example \(\beta_1\) shrunk more significantly and is not different from zero.
At the same time \(\beta_2\) has become more different from zero.
Regularization reinforced one of the two correlated predictors while dumping the other.

Again, fit linear model, ridge and lasso regressions to the same data.

1- Linear model.

lmFit<-lm(dataListShrink2$y~dataListShrink2$x[,1]+dataListShrink2$x[,2])
summary(lmFit)


Call:
lm(formula = dataListShrink2$y ~ dataListShrink2$x[, 1] + dataListShrink2$x[, 
    2])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7765 -0.7585  0.0172  0.6577  3.1721 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)              4.5619     0.4878   9.353  < 2e-16 ***
dataListShrink2$x[, 1]  -0.5468     0.6707  -0.815    0.415    
dataListShrink2$x[, 2]   2.5034     0.4479   5.589 3.76e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9988 on 497 degrees of freedom
Multiple R-squared:  0.9962,    Adjusted R-squared:  0.9962 
F-statistic: 6.53e+04 on 2 and 497 DF,  p-value: < 2.2e-16

2- Ridge.

set.seed(15)
cv.outRidge=cv.glmnet(x=dataListShrink2$x,y=dataListShrink2$y,alpha=0)
plot(cv.outRidge)

(bestlam <-cv.outRidge$lambda.min)

[1] 1.61435

ridgeFit<-glmnet(x=dataListShrink2$x,y=dataListShrink2$y,
                 alpha=0,lambda=bestlam,standardize = F)
(ridge.coef<-predict(ridgeFit,type="coefficients",s=bestlam))

3 x 1 sparse Matrix of class "dgCMatrix"
                   s1
(Intercept)  4.943942
x1          -1.425696
x2           1.910632

3- Lasso.

set.seed(15)
cv.outLasso=cv.glmnet(x=dataListShrink2$x,y=dataListShrink2$y,alpha=1)
plot(cv.outLasso)

(bestlam <-cv.outLasso$lambda.min)

[1] 0.1062134

lassoFit<-glmnet(x=dataListShrink2$x,y=dataListShrink2$y,
                 alpha=1,lambda=bestlam,standardize = F)
(lasso.coef<-predict(lassoFit,type="coefficients",s=bestlam))

3 x 1 sparse Matrix of class "dgCMatrix"
                  s1
(Intercept) 4.116010
x1          .       
x2          2.865189

Compare coefficients from all 3 models.

comparison<-cbind(summary(fit3)$summary[1:3,c(1)],
      summary(fit6)$summary[1:3,c(1)],
      Ridge=ridge.coef,
      Lasso=lasso.coef,
      Linear=lmFit$coefficients)
colnames(comparison)<-c("NoShrinkage","Shrinkage","Ridge","Lasso","Linear")
t(comparison)

5 x 3 sparse Matrix of class "dgCMatrix"
            (Intercept)         x1       x2
NoShrinkage    4.585760 -0.5967437 2.469961
Shrinkage      4.543736 -0.5261239 2.517212
Ridge          4.943942 -1.4256958 1.910632
Lasso          4.116010  .         2.865189
Linear         4.561922 -0.5468278 2.503444

All models correctly exclude first slope.
Lasso does it decisively, making slope \(\beta_1\) and exactly equal to \(0\).
Lasso also estimated intercept and \(\beta_2\) more accurately than other models: recall that for this data set we simulate \(\beta_0=4, \beta_2=3\).

Is school financing necessary?

Analysis of SAT scores, example from Kruschke, 2015, section 18.3.

These data are analyzed in the article by Deborah Lynn Guber.

The variables observed are:

states (State)
the mean SAT score by state (SATV, SATM and SATT),
amount of money spent by student (Spend),
percent of students who take SAT (PrcntTake) and
other variables (student to teacher ratio, teacher salary).

Read the data from file Guber1999data.csv available at Kruschke, 2015.

myData = read.csv("data/Guber1999data.csv")  # section 18.3 @ Kruschke
head(myData)

       State Spend StuTeaRat Salary PrcntTake SATV SATM SATT
1    Alabama 4.405      17.2 31.144         8  491  538 1029
2     Alaska 8.963      17.6 47.951        47  445  489  934
3    Arizona 4.778      19.3 32.175        27  448  496  944
4   Arkansas 4.459      17.1 28.934         6  482  523 1005
5 California 4.992      24.0 41.078        45  417  485  902
6   Colorado 5.443      18.4 34.571        29  462  518  980

pairs(myData[,-c(1,6:7)])

plot(myData$Spend,myData$SATT)

summary(lm(myData$SATT~myData$Spend))$coeff

               Estimate Std. Error   t value     Pr(>|t|)
(Intercept)  1089.29372  44.389950 24.539197 8.168276e-29
myData$Spend  -20.89217   7.328209 -2.850925 6.407965e-03

The plots show that mean SAT score is negatively correlated with amount of money states spend per student. These results were used in hot debates about spending money on education to support argument in favor of reducing public support for schools.

Prepare the data.

Use the 2 predictors from the file, plus add 12 randomly generated nuisance predictors.

Ntotal <- nrow(myData)
y <- myData$SATT
x <- cbind(myData$Spend, myData$PrcntTake)
colnames(x) <- c("Spend","PrcntTake");
dataList2Predict <- list(Ntotal=Ntotal,y=y,x=x,Nx=ncol(x))
# generate 12 spurious predictors:
set.seed(47405)
NxRand <- 12
for (xIdx in 1:NxRand) {
    xRand = rnorm(Ntotal)
    x = cbind(x, xRand )
    colnames(x)[ncol(x)] = paste0("xRand", xIdx)
}
dataListExtraPredict <- list(Ntotal=Ntotal,y=y,x=x,Nx=ncol(x))

No-shrinkage

Use the same model as in the example of the first section: RobustMultipleRegressionDso.

First, run the model with 2 predictors.

fit_noshrink2Pred <- sampling (RobustMultipleRegressionDso, 
                          data=dataList2Predict, 
                          pars=c('beta0', 'beta', 'nu', 'sigma'),
                          iter=5000, chains = 2, cores = 2)
summary(fit_noshrink2Pred)$summary[,c(1,4,8)]

               mean       2.5%      97.5%
beta0   991.4007496 947.253200 1036.31532
beta[1]  12.8196529   4.235574   21.51116
beta[2]  -2.8747599  -3.308695   -2.44884
nu       33.4879111   3.455822  114.56115
sigma    31.6173398  24.425564   39.83589
lp__     -0.1425401  -4.083100    1.96456

It is clear that the slope of Spend is significantly positive and slope of PrcntTake is significantly negative.

This shows that the negative correlation between SAT scores and the money spent as seen from the scatterplot is illusory: fewer students from underfunded schools take SAT, but these are only students who apply for colleges; students who potentially would receive low SAT scores do not apply to college and do not take the test.

Run MCMC for the model with additional nuisance predictors.

fit_noshrinkExtra <- sampling (RobustMultipleRegressionDso, 
                          data=dataListExtraPredict, 
                          pars=c('beta0', 'beta', 'nu', 'sigma'),
                          iter=5000, chains = 2, cores = 2)

Analyze the output with shinystan .

launch_shinystan(fit_noshrinkExtra)

Here are the results of MCMC.

stan_ac(fit_noshrinkExtra, separate_chains = T)

pairs(fit_noshrinkExtra,pars=c("beta0","beta[1]","beta[2]"))

plot(fit_noshrinkExtra,pars=c('beta'))

stan_dens(fit_noshrinkExtra,pars=c("beta0","beta"))

All densities look symmetrical: mean values of posterior distributions can be used as point estimates of betas.

summary(fit_noshrinkExtra)$summary[,c(1,4,8)]

               mean       2.5%        97.5%
beta0    998.618086 952.442828 1044.8766561
beta[1]   10.234291   1.233829   19.2611967
beta[2]   -2.721493  -3.179022   -2.2554795
beta[3]    2.304234 -11.450917   15.7354294
beta[4]   -4.728814 -14.889133    5.4752469
beta[5]    6.250301  -7.252438   19.1997960
beta[6]   -5.696532 -15.225194    3.7955166
beta[7]    6.989249  -2.925123   16.8434431
beta[8]    1.908281  -8.072728   12.2999566
beta[9]    4.283368  -6.011922   14.5423902
beta[10]   2.651897 -11.337831   17.1438898
beta[11]  -4.221621 -13.803873    5.1213425
beta[12] -10.363432 -20.155584   -0.7626764
beta[13]  -1.894524 -12.602323    8.8656771
beta[14]   1.710674  -8.973928   12.1135095
nu        31.187292   2.027864  105.5120668
sigma     30.255439  21.367424   39.5905702
lp__       1.240241  -6.398344    6.6903196

The variables corresponding to betas are:

colnames(x)

 [1] "Spend"     "PrcntTake" "xRand1"    "xRand2"    "xRand3"   
 [6] "xRand4"    "xRand5"    "xRand6"    "xRand7"    "xRand8"   
[11] "xRand9"    "xRand10"   "xRand11"   "xRand12"

Note that the coefficient for variable Spend is still positive, but the left side of HDI interval is much closer to zero. The coefficient for PrcntTake is still significantly negative.
One of the nuisance predictors happened to be significantly negative: beta[12].

As a result of adding nuisance predictors the accuracy of inference becomes lower.

Shrinkage

Analyze the same data with the model encouraging shrinkage of parameters.

First, fit the model without nuisance parameters.

fit_shrink <- sampling (RegressionShrinkDso, 
                        data=dataList2Predict, 
                        pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
                        iter=5000, chains = 2, cores = 2)

Check convergence in shiny.

launch_shinystan(fit_shrink)

pairs(fit_shrink,pars=c("beta0","beta","nu","sigma","sigmaBeta"))

plot(fit_shrink,pars=c('beta'))

stan_dens(fit_shrink,pars=c('beta'))

stan_ac(fit_shrink, separate_chains = T)

Compare with the fit without nuisance parameters and without shrinkage.

cbind(summary(fit_noshrink2Pred)$summary[1:4,c(1,4,8)],
      summary(fit_shrink)$summary[1:4,c(1,4,8)])

             mean       2.5%      97.5%       mean       2.5%
beta0   991.40075 947.253200 1036.31532 996.071254 951.012629
beta[1]  12.81965   4.235574   21.51116  11.770134   2.632324
beta[2]  -2.87476  -3.308695   -2.44884  -2.831416  -3.275607
nu       33.48791   3.455822  114.56115  33.323074   3.756878
              97.5%
beta0   1042.051358
beta[1]   20.514489
beta[2]   -2.361973
nu       109.178594

First variable shrunk closer to zero: mean value is smaller and left end of the 95%-HDI is closer to zero.

Now fit the model with additional parameters.

fit_shrinkExtra <- sampling (RegressionShrinkDso, 
                        data=dataListExtraPredict, 
                        pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
                        iter=5000, chains = 2, cores = 2)
stan_ac(fit_shrinkExtra, separate_chains = T)

pairs(fit_shrinkExtra,pars=c("beta0","beta[1]","beta[2]","beta[3]","beta[4]","beta[11]","beta[12]"))

pairs(fit_shrinkExtra,pars=c("nu","sigma","sigmaBeta"))

plot(fit_shrinkExtra,pars=c('beta'))

stan_dens(fit_shrinkExtra,pars=c('beta'))

Note characteristic pinched tips of posterior densities for shrunk variables.

summary(fit_shrinkExtra)$summary[,c(1:4,8)]

                   mean     se_mean         sd          2.5%
beta0     1009.82602144 4.473582390 26.5068325  9.628679e+02
beta[1]      8.21954511 1.078486389  5.2814904 -2.802567e-01
beta[2]     -2.68755852 0.033018532  0.2279845 -3.164852e+00
beta[3]      1.16429414 0.538365949  3.1397865 -4.187581e+00
beta[4]     -1.16002937 0.229745431  2.6666174 -8.523255e+00
beta[5]      1.78068242 0.492660992  3.0628568 -2.483201e+00
beta[6]     -2.47265483 1.152596733  3.7247172 -1.093501e+01
beta[7]      2.76863569 1.135585515  3.9366726 -1.659493e+00
beta[8]      0.86387877 0.267798818  2.3636293 -2.570624e+00
beta[9]      1.36957487 0.314802537  2.8913159 -2.288281e+00
beta[10]    -1.39643294 1.838822871  3.7474360 -1.012010e+01
beta[11]    -1.00165428 0.236339260  2.3005006 -7.569141e+00
beta[12]    -6.61450384 0.710490722  5.4495075 -1.794561e+01
beta[13]    -0.38037233 0.100166767  1.9872390 -5.869640e+00
beta[14]     0.76980004 0.466670211  2.3077738 -3.803671e+00
nu          40.07666808 5.561396626 34.6493271  3.403896e+00
sigma       30.81178901 0.343535668  3.8888139  2.332439e+01
sigmaBeta    0.01966483 0.002562242  0.0270782  5.482108e-04
lp__        21.13792207 0.824755562  6.4649353  9.719103e+00
                  97.5%
beta0     1058.28908450
beta[1]     18.23938239
beta[2]     -2.24414288
beta[3]      8.71009792
beta[4]      3.08755229
beta[5]      9.02094911
beta[6]      1.45812939
beta[7]     11.60701517
beta[8]      7.02230696
beta[9]      9.15217808
beta[10]     5.87438345
beta[11]     2.47309092
beta[12]     0.61143385
beta[13]     3.18343802
beta[14]     6.30005198
nu         129.97213782
sigma       38.72621370
sigmaBeta    0.09534239
lp__        35.39122557

Parameter beta[12] has shrunk to zero based on 95%-HDI as a result of regularized model.
This helped removing all nuisance parameters. But shrinkage also removed parameter beta[1] of variable Spend!!!

Linear model

Compare with linear model.

Without nuisance predictors:

lmSAT<-lm(y~x[,1]+x[,2])
summary(lmSAT)


Call:
lm(formula = y ~ x[, 1] + x[, 2])

Residuals:
    Min      1Q  Median      3Q     Max 
-88.400 -22.884   1.968  19.142  68.755 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 993.8317    21.8332  45.519  < 2e-16 ***
x[, 1]       12.2865     4.2243   2.909  0.00553 ** 
x[, 2]       -2.8509     0.2151 -13.253  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32.46 on 47 degrees of freedom
Multiple R-squared:  0.8195,    Adjusted R-squared:  0.8118 
F-statistic: 106.7 on 2 and 47 DF,  p-value: < 2.2e-16

confint(lmSAT)

                 2.5 %      97.5 %
(Intercept) 949.908859 1037.754459
x[, 1]        3.788291   20.784746
x[, 2]       -3.283679   -2.418179

With nuisance predictors:

lmSATAll<-lm(y~.,data=as.data.frame(cbind(y,x)))
summary(lmSATAll)


Call:
lm(formula = y ~ ., data = as.data.frame(cbind(y, x)))

Residuals:
    Min      1Q  Median      3Q     Max 
-61.485 -17.643   1.093  15.349  64.549 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1002.601     22.231  45.100  < 2e-16 ***
Spend          9.503      4.313   2.203   0.0342 *  
PrcntTake     -2.703      0.228 -11.853 8.29e-14 ***
xRand1         2.164      6.842   0.316   0.7536    
xRand2        -5.190      5.015  -1.035   0.3078    
xRand3         6.424      6.599   0.974   0.3370    
xRand4        -5.678      4.899  -1.159   0.2542    
xRand5         7.363      4.957   1.485   0.1464    
xRand6         1.606      5.074   0.316   0.7536    
xRand7         3.909      5.034   0.777   0.4426    
xRand8         3.060      7.072   0.433   0.6679    
xRand9        -4.654      4.397  -1.058   0.2971    
xRand10      -10.265      4.816  -2.131   0.0401 *  
xRand11       -2.912      5.252  -0.555   0.5827    
xRand12        1.334      5.258   0.254   0.8013    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.51 on 35 degrees of freedom
Multiple R-squared:  0.8733,    Adjusted R-squared:  0.8226 
F-statistic: 17.23 on 14 and 35 DF,  p-value: 1.083e-11

confint(lmSATAll)[2:3,2]-confint(lmSATAll)[2:3,1]

     Spend  PrcntTake 
17.5101430  0.9258323

confint(lmSAT)[2:3,2]-confint(lmSAT)[2:3,1]

    x[, 1]     x[, 2] 
16.9964551  0.8655003

These also show that addition of nuisance parameters widened confidence intervals.

Ridge and lasso

set.seed(15)
cv.outRidge=cv.glmnet(x=dataListExtraPredict$x,y=dataListExtraPredict$y,alpha=0)
plot(cv.outRidge)

(bestlam <-cv.outRidge$lambda.min)

[1] 6.570761

ridgeFit<-glmnet(x=dataListExtraPredict$x,y=dataListExtraPredict$y,
                 alpha=0,lambda=bestlam,standardize = F)
ridge.coef<-predict(ridgeFit,type="coefficients",s=bestlam)
set.seed(15)
cv.outLasso=cv.glmnet(x=dataListExtraPredict$x,y=dataListExtraPredict$y,alpha=1)
plot(cv.outLasso)

(bestlam <-cv.outLasso$lambda.min)

[1] 7.732551

lassoFit<-glmnet(x=dataListExtraPredict$x,y=dataListExtraPredict$y,
                 alpha=1,lambda=bestlam,standardize = F)
lasso.coef<-predict(lassoFit,type="coefficients",s=bestlam)
comparison<-round(cbind(summary(lmSATAll)$coefficients[,c(1,4)],
                        summary(fit_noshrinkExtra)$summary[1:15,c(1,4,8)],
                        summary(fit_shrinkExtra)$summary[1:15,c(1,4,8)],
                        ridge.coef, lasso.coef),3)
comparison<-as.matrix(comparison)
colnames(comparison)<-c("LM","LM-Pv","NoShrink","NoShrink-L","NoShrink-H",
                        "Shrink","Shrink-L","Shrink-H","Ridge","Lasso")
comparison

                  LM LM-Pv NoShrink NoShrink-L NoShrink-H   Shrink
(Intercept) 1002.601 0.000  998.618    952.443   1044.877 1009.826
Spend          9.503 0.034   10.234      1.234     19.261    8.220
PrcntTake     -2.703 0.000   -2.721     -3.179     -2.255   -2.688
xRand1         2.164 0.754    2.304    -11.451     15.735    1.164
xRand2        -5.190 0.308   -4.729    -14.889      5.475   -1.160
xRand3         6.424 0.337    6.250     -7.252     19.200    1.781
xRand4        -5.678 0.254   -5.697    -15.225      3.796   -2.473
xRand5         7.363 0.146    6.989     -2.925     16.843    2.769
xRand6         1.606 0.754    1.908     -8.073     12.300    0.864
xRand7         3.909 0.443    4.283     -6.012     14.542    1.370
xRand8         3.060 0.668    2.652    -11.338     17.144   -1.396
xRand9        -4.654 0.297   -4.222    -13.804      5.121   -1.002
xRand10      -10.265 0.040  -10.363    -20.156     -0.763   -6.615
xRand11       -2.912 0.583   -1.895    -12.602      8.866   -0.380
xRand12        1.334 0.801    1.711     -8.974     12.114    0.770
            Shrink-L Shrink-H    Ridge    Lasso
(Intercept)  962.868 1058.289 1005.862 1027.326
Spend         -0.280   18.239    8.932    5.032
PrcntTake     -3.165   -2.244   -2.697   -2.607
xRand1        -4.188    8.710    2.061    0.000
xRand2        -8.523    3.088   -5.024    0.000
xRand3        -2.483    9.021    5.004    0.000
xRand4       -10.935    1.458   -5.064    0.000
xRand5        -1.659   11.607    6.686    0.000
xRand6        -2.571    7.022    1.916    0.000
xRand7        -2.288    9.152    3.828    0.000
xRand8       -10.120    5.874    2.090    0.000
xRand9        -7.569    2.473   -4.497    0.000
xRand10      -17.946    0.611   -9.411   -4.142
xRand11       -5.870    3.183   -2.396    0.000
xRand12       -3.804    6.300    0.807    0.000

Note that there is no way to extract from ridge and lasso regressions any measure for comparison with zero, like confidence intervals.

Linear model keeps both Spend and PrcntTake and removes with 5% level all nuisance coefficients except xRand10
Bayesian model without shrinkage does the same.
Bayesian model with shrinkage shrinks to zero all artificial predictors, but it also removes Spend.
Ridge in general is consistent with linear model, but it is not clear if it shrinks any parameters to zero or not. Lasso fails to shrink to zero several artificial parameters.

Series 4 of 10 – Fitting Linear Models - Multiple Regression

Review a simple linear regression in Bayes

Multiple Regression

Example 1: Two significant predictors

Reminder of Stan

Example 2: Insignificant predictor

Example 3: Correlated predictors

Strong correlation

Collinearity

Shrinkage of regression coefficients

Two significant predictors

Analysis and comparison

Insignificant predictor

Analysis and comparison

Correlated predictors

Analysis and comparison

Is school financing necessary?

No-shrinkage

Shrinkage

Linear model

Ridge and lasso

Further reading

Corrections

Reuse

Citation